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1 Introduction 

In recent years, much attention has been paid to regularized regression for high dimensional linear 
models 21 El El [HI ttFJ 123] ■ Meanwhile, a much smaller body of works has been devoted to such 
regression for high dimensional generalized linear models [3D]. Despite the impressive progress, the 
full potential of regularized regression for models with underlying linear structures seems far from 
being fully explored. 

Regression is a type of optimization. In current literature on high dimensional generalized linear 
;_h ' models, the target, or loss, functions being optimized have the form ^%{Xj u, Yi), where % is a 

known function, Xi a high dimensional covariate, u a parameter, and Yi a response variable. Two 
possible extensions of the regression can be identified as follows. First, instead of one parameter 
vector, a small number of parameter vectors may appear in a model, so that the loss functions 
become ji(Xj U\, . . . , Xjuk, Yi). Parameter estimation involving multiple linear combinations of 
covariates has been considered at least in neuroscience, where multiple informative dimensions of 
visual signals of very high dimension need to be estimated based on a relatively small amount of 
data, so that the neural activity in a visual system may be better characterized [2]. The goal of 
the effort is rather ambitious, which is to estimate the functional form of ji nonparametrically along 
with a few informative dimensions characterized by u±, . . . , itfc. However, it seems that a rigorous 
development toward this goal is difficult using currently available statistical methods. A more modest 
goal is to estimate u±, . . . ,Uk while having 7, fixed. For apparently more flexible loss functions 
7i(#, Xj U\, . . . , Xj Uk, Yi), where 9 is a parameter controlling the shape of ji, by adding auxiliary 
covariates into Xi, one can reformulate them into ^i(Zj V\, . . . , Zjvi, Yi). Of course, the dimension 
of has to be low. Once the dimension of 9 gets high, the estimation becomes no less challenging 
than the aforementioned nonparametric estimation. 
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Second, instead of both the covariates and response variables being observed, Xi may be missing 
and only Yi are observed. To be specific, suppose we wish to use regularized likelihood estimation. 
Each loss function is then the logarithm of the marginal of Yi at parameter value u, which no longer 
has the form ^(Xju, Yi). Parameter estimation with missing data is certainly of interest in its own 
right. Naturally, one has to make more assumptions on the structure of the random object (Xi, Yi) in 
order to estimate u. The issue is, provided such assumptions are made, whether regularized regression 
can still work when u is of high dimension. 

To attack these two regression problems, we use a method in [TO], which establishes estimation 
precision by first obtaining certain stochastic Lipschitz continuity results for the total loss function 
and then combining it with t\ regularized regression (Lasso). Basically, if L(u) denotes the empirical 
total loss, with u = (ui, . . . , Uk), then the so called stochastic Lipschitz continuity is concerned with 
the upper tail behavior of the supremum of 



where u\, . . . , lift are allowed to vary over a certain domain of parameter values. Note that we are 
interested in the fluctuation of the loss, i.e., L(u) — EL(u), rather than the loss itself. If v%, . . . ,Vk 
are also allowed to vary over the domain, then by definition, the supremum is just the Lipschitz 
coefficient of L(u) — EL(u). With a little abuse of language, if v\, . . . ,Vk are fixed, the supremum 
will be referred to as the local Lipschitz coefficient at v. As in the display, we shall always consider 
stochastic Lipschitz continuity with respect to (wrt) i\ norm. 

For linear models, stochastic Lipschitz continuity has already been recognized as a useful tool 
to study high dimensional Lasso (cf. [4] [6] and references therein) . The issue becomes significantly 
more involved for the problems we consider. Our solution requires certain comparison results on 
Rademacher complexity |14j . The topic of Rademacher complexity has recently generated quite 
amount of interest [TJ [3j [T2j [TT1 [22] . The results in these works are on processes of the type ^ £i/i(ti), 
where e$ are independent Rademacher variables, i.e., Pr{e.; = 1} = Pr{ej = —1} = 1/2. Without 
going into detail, the point is that fi are univariate, i.e., t{ € R. It turns out we need comparison 
results involving multivariate fi. However, it appears that such results are not yet available in the 
literature. As a technical preparation, two such results will be given in Section [21 both having the 
classical form as in [14J . Similar to [JJ, some of the results can be extended to symmetric integrable e% 
that need not be identically distributed. This is potentially useful for dealing with stochastic Lipschitz 
continuity involving unbounded noise terms |10j , such as subgaussian ones that allow similar measure 
concentration as bounded noise (cf. [13], p. 41). A detailed study on this, however, is beyond the 
scope of the article. 

Sections [3] and 2] deal with regression involving multiple linear combinations of covariates. First, 
in Section [3] we use the multivariate comparison results in Section [2] to derive stochastic Lipschitz 
continuity for loss functions of the form ^2 i<N ji(ZiU, Yi), where Z\, . . . , Zn are fixed matrices and 
Y\, . . . , Y/v are independent random variables. It is not hard to see that such loss functions include 
J2i<N li(Xi~Ui 7 . . . , Xjuk, Yi) as special cases, if each Zi is appropriately constructed from Xi. For 
the parameter estimation considered in Section 01 local stochastic Lipschitz continuity is sufficient 
for our need. However, (the usual) stochastic Lipschitz continuity is known in the context of linear 
regression and not difficult to be established following the method for local stochastic Lipschitz 
continuity. For completeness, we shall give a result on stochastic Lipschitz continuity as well. In 
Section [4] we apply the results in Section [3] to the Lasso estimator 



\(L(u)-EL(u))-(L(v)-EL(v))\ 

Ei< fc IN - vkh 



(01,..., fe ) = argmin 




where D is a domain of parameter values and A > is a tuning parameter. Comparing to the case 
where k = 1, the issue is that the Lasso only gives a comparison between X)j<fc ll^jlli an d X)j<fc ll^jlli; 
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where 6j are the true parameter values. However, it does not provide direct comparisons between 
||0j||i and for individual j < k, which are needed to bound the total £2 error of Oj. This issue 

can be resolved by using an eigenvalue condition on the design matrix consisting of Xi [3]. 

Section [5] deals with i\ regularized likelihood estimation when the covariates are missing. Most 
effort of this Section is spent on establishing stochastic Lipschitz continuity for loss functions ex- 
pressed, roughly speaking, as J2i<N m / f( xTu ^i)dfi(x\Yi). As can be expected, the logarithmic 
and integral transformations in the expression are the major obstacles to the exploitation of the im- 
plicit linearity. Comparison results on Rademacher complexity, including those in Section [21 will be 
invoked to get them out of the way. After stochastic Lipschitz continuity is in place, the rest of the 
work is similar to the full data case and actually requires fewer technical assumptions. Finally, proofs 
of auxiliary results are collected in the Appendix. 



1.1 Notation 

For s E R , denote by Si,...,Sf. its coordinates and spt(s) its support, i.e, {j : Sj 7^ 0}. For 
J C {1, . . . , fc}, denote by ttj the function that maps s to (s^, . . . , s' k ) with s'j — Sjl {j G J}. For 
q € [1 , 00] , the l q norm of a £ R fe is 

I maxj<fc \sj\ if q = 00. 

A function h from R fc to R is called (M, £ g )-Lipschitz, if 

\h(t) - h(s)\ < M\\t - s\\ q , for all s,teR k . 

If h is defined on R, then, as all the £ g norms are the same on R, we simply say h is M-Lipschitz. 
For any random variable £, denote 

m = z - E e 

If Xi, . . . , Xn and Y\,..., Yjv are independent random variables, denote by Ex (resp. Ey) the integral 
wrt the (marginal) law of X\, . . . , Xn (resp. Y\, . . . , Yjy). 

With a little abuse of notation, by x = argmin / we mean f(x) = min/ and that the minimizer 
of / may not be unique. The same interpretation applies to argmax, argsup and arginf . 



2 Comparison theorems for multivariate functions 



Let N > 1 and k > 1 be integers. Denote V = R and denote elements in V by t = {t\, . . . ,iiv), 
with ti = (tn, . . . , Uk) € V. In this section, Eh and £y will always denote Rademacher variables. 
Furthermore, they are always assumed to be independent from each other. 



Theorem 2.1. Let T C V be a bounded set and hi, . . . , hw be functions V — > R such that each hi 
is (Mi, too) -Lipschitz o-nd satisfies the vanishing condition that hi(t) — if some tj — 0. Then, for 
any function $ : [0,oo) — > R convex and nondecreasing, 



Sihi(u) 




■M^iEijtij 



< E$ 1 sup 

teT 



MiE 



i>3 



(1) 



Furthermore, for G 



convex and nondecreasing, 



EG sup J2 eihifc) < EG sup^2 MiEijt 



teT 



1,3 



(2) 
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The vanishing condition in Theorem 12. II is satisfied, for example, by tiffa), where / is Lipschitz 
with /(0) = 0. In general, while a function h with h(0) = may not satisfy the condition, it always 
allows a decomposition into a sum of functions each satisfying the condition. For example, if h is 
defined on R 2 , then h(s, t) = f(s, t) + h(0, t) + h(s, 0) with f(s, t) = h(s, t) - h(0, t) - h(s, 0), h(0, t), 
h(s,0) each satisfying the vanishing condition. The decomposition leads to the following result. 

Theorem 2.2. Let T <zV N be a bounded set and h\, ... , /ijv be functions V — > K such that each hi 
is {M,, (.^-Lipschitz with h^O) = 0. For j < k, let T s = {{t xj , . . .,t Nj ) : (t 1; . . . ,t N ) G T} C K w . 
Then 



Esup 



^2 £ ^i(*0 



■<N 



< p k V" E sup 



£ * M i- 



■<N 



(3) 



3 *-i _ 2 * 



where {5% is a universal constant that can be set no greater than 3 k 

Similar to the univariate results in pQ , Theorem l2.2l remains true if are replaced with independent 
integrable symmetric variables 7,. Indeed, by (71, . . . ,Jn) ~ (ei|7i|, ■ • ■ ,£jv|7iv|), where ei,...,£jv 
are independent from 7,, the result follows by first integrating over £j while conditioning on |7j|, and 
then integrating over |7j|. As can be seen, 7^ need not be identically distributed in the argument. 



2.1 Proofs 
Lemma 2.3. Let h : V 



be (M, loo) -Lipschitz and satisfies the condition that h(t) = if some 



tj =0. Suppose S C K x V is bounded. Then for any G 



convex and nondecreasing, 



EG ( sup (x + e h(s)) I < EG sup i + mV ejSj 
\(*,s)es ) \(x.s)es y J < fc j j 



Proof. First, we notice that 



|fc(t)|<Mmin(|t 1 |,...,|t fc |). 



(4) 



(5) 



Indeed, for any j < k, let s — i"{i,...,fe}\{j}*i i- e - s nas the same coordinates as t except the jth one 
being 0. Then h(s) = 0, and as h is (M, ^ co )-Lipschitz, \h(t)\ = \h(t) - h(s)\ < M\\t-s\\oo = M\tj\. 

We shall assume S is compact. By dominated convergence, the assumption causes no loss of 
generality. Also, we shall assume M = 1. Otherwise, we can use change of variables s' — Ms, 
h'(s') = h(s'/M) to reduce to this case. Let 



(a, u) — argsup {x + h(s)), (b, v) — argsup (x — h{s)). 

(x,s)es (x,s)es 



Then 



EG sup (x + e h(s)) = - [G {a + h{u)) + G {b - h(v))] . 
\(x,s)es j 2 

Assume \ui — Vi \ = \\u — vW^. Then, in order to show Q, it suffices to show 

G(a + h(u)) + G{b-h(v)) 











E 


H 


< < 








E 












if m > v i: 



else. 



(6) 
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Suppose Ui > Vi. Since G is convex, by Jensen's inequality, ([5]) is implied by 
G(a + h{u)) +G(b- h{v)) < G (a + m) + G (b - v t ) . 



(7) 



Following [Tl], the proof of ([7]) is divided into 3 cases. 

1) m > Vi > 0. Now © is equivalent to G (b - h(v)) - G (b - < G (a + m) - G (a + h(u)), 
so by the convexity of G, we only need to show 



Ui — h(u) > Vi — h(v) > 0, a + h(u) > b — v.i 



(8) 



Since h is (1, £oo)-Lipschitz, h(u) — h(v) < \\u — v\\oo = Ui — Uj, and so u.i — h(u) > Vi — h(v). By 
(O, Vi — h{y) > Vi — \h(v)\ > and together with the definition of u, a + h(u) > b + h(v) > b — Uj. 
Thus © follows. 

2) Ui > > Vi. By ((I]), a + m > a + \h(u)\ > a + h(u), b - m = b + \vi\ >b + \h(v)\ >b - h(v). 
Since G is nondecreasing, then ([7]) holds. 

3) > Ui > Vi. It suffices to show G{a + h{u)) - G {a + m) < G(b — Vi) — G(b- h(v)). The 
proof is completely similar to case 1). We thus have shown ([6]) for the case Ui > v.- L . The proof for 
Ui < Vi is completely similar. 



Proof of Theorem \2.1\ First, (JT|) is a consequence of ([2]). To see this, let Ht — (hx(ti) 
and e — (ei, . . . , £n). Then 

sup|(e,iJ t )| =sup((e, J ff t ) + + (e,H t )~) < sup (e, H t ) + + sup (e, H t )~ . 
teT teT teT teT 

So by the nondecreasing monotonicity and convexity of $, 
E$ Qsup|< e ,ff t >f) < \ 



□ 

/ijv(tjv)) 



E$ sup (e, fli) 



E$ sup (e, H t ) 

K teT 



Since e ~ — e, then sup tgT (e, i/t) ~ sup teT (— e, i/t) = sup tgT (e, Ht) + , which together with the 
previous inequality yields 

E$ (\wp\{e,H t )\\ <E$ (sup(£,H t ) + ) = E$ ( f sup (e, H t 
\ z teT J \teT / \\teT 

where the equality follows from the fact that sup agj4 a + — (sup agyl a) + for any icl. Now use the 
fact that G{x) = $(a; + ) is convex and increasing and J2]) to get 



E$ 



i sup | (e, H t ) < E$ f f sup J] .\/, ,.,/ j j 



This proves the first inequality in (fTJ). By the nondecreasing monotonicity of $, the second inequality 
in (fTJ) follows. 

It remains to show If N = 1, then Tc V and by letting 5 = {(0, t) : t G T}, © follows from 
Lemma l2~3l Suppose N >2. Given zi, . . . , zjv-i € { — 1, 1}, let 



S ' = < | J] Zjhjitj), t N ] : (ti, ...,t N ) G T } C M x V. 



Then by Lemma 1 2. 3 



(i,«)es 



) £EG 


sup 




Ux.s)es \ 
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Since £j and £y are independent, this can be written as 



G | sup E Sihi(ti) 

t&T i<N 



< E 




sup 






V teT \ 



y i<N-l 

Integrate over 21, ... , zjv-i to get 



ei = Zi, i < N —1 



j<k 



Ei =Zi, i < N - 1 



teT 





sup 




li 6 T \ 



i<N-l 



j<k 



Now apply the same argument to the expectation on the right hand side, except that we condition 
on £j, i < N — 1 and £jyj, j < fc. Then the expectation is no greater than 

II 

EG sup E £ihi(U) + E M i ^ s ij U 3 

\ tGT \i<N-2 i=N-l j<k 

The proof is then finished by induction. □ 
Proof of Theorem [Ql Write [k] = {1, . . . , k} and w-j for K[k]\{j}- Then for i < N and t £ V, 

(9) 



m*) = E 

JC[fc] 



where fij(t) = S/ C j(— 1)' ' <hi(irit). Indeed, the right hand side of © is 

E £(-i) |JH/| >*iM) = E ^M) E 

JC[fc]/CJ 7C[fc] /C./C[fc] 

= E - i)*~ m - M*)- 

7C[fc] 

Since hifat) are (Mj, i^-Lipschitz and hi(ir®t) — hi(0) = 0, for each J with | J| = s > 0, fij is 
(c s Mi, £ 00 )-Lipschitz, where c s = 2 s — 1. It is easy to see that fij(t) only depends on tj with j G J. 
For each j £ J, letting s = n-jt, 

fu(s)= E (-l) |JH/| M-/s)+ E (-l) |JHWU/| M- {j }u/s) 
i^/cJ j?ic.J 

= E (-i) |Jh|/| [^U/^)-^(^}u/s)]. 

For every I not containing j, 717s = ir^yijjs = nit. As a result, fij(n-jt) = 0. In orther words, as a 
function only in (tj,j £ J), fij(t) vanishes if tj = for some j. Then by Theorem 12. II 



E sup 

teT 



E 

i<N 



£ifij(U) 



< 2c s E sup 
teT 



^ i e y^»j 



G 



Since all £{, are i.i.d., 



E sup 



Mi^Eijtij 
i<N je.J 



< E £ sup 



teT 



y e sup 



Combining © and the above bound, 
^ Sihi(t, 



E | sup 

teT 



< y~] e I sup 

Jew \ teT 



i<N 



<2 £(2^-1) ^E sup 

Jc[fc] j'eJ seTi 

By simple combinatorial calculation, the proof is complete. 



□ 



3 Stochastic Lipschitz conditions 

Let (Yi,Zi), . . . , (Yn,Zn) be independent random vectors, with Yi taking values in a measurable 
space y and Zi being k x p matrices. For j < k, denote by Zjj the jth row vector of Zi and for h < p, 
Zijh the (j, h)th entry of Zi. That is 



Zi = 



Ziii Zixi 

Zikl Zik2 



Zilp 
Zikp / 



Henceforth, we consider the case where Zi are fixed. Let D C R p be a fixed domain. Then for 
i < N and u e D, Zm e R k . Define 



Mz = maxmax ||^j||oo, Rd ■= sup 

i<N j<k u,v£D 



(10) 



Suppose 7i, . . . , 7tv are real valued functions on R fc x y. For j < k, denote by dj the first partial 
differentiation wrt tj . We make the following assumption. 

Assumption 1. For all i < N and y € y, Ji(t, y) is first order differ entiable in t, such that 

Ft := sup{\dM*>v)\ ■ s e R fc , V € y, J < k, i < N} < oo, 
\djlt(s,y) - djfi(t,y)\ k 



.-* 2 := sup 



: s,t eR K , s ^t, y &y, j < k, i < N } <oo 



Theorem 3.1 (Local stochastic Lipschitz continuity). Let 8 € D be fixed. Under Assumption]^ for 

u e d, 

]T MZ iU , Yi)} = ]T H(Z t 9, Y^ + IdMZiO, Yi)} Zj 3 {u - 0) + f(«) T (u - 0), (11) 



i<N 



i<N 



where £(«) € M. p is a process with the property that for any q £ (0, 1), 
PrJ sup Halloo > A /ln(2p)£max£ Zf jh 

+ BL(p/q) max Z ljh + C Hp/i) j < 9, 



(12) 
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where, letting 

4> = M z min(2F u F 2 M z R D ), i> = k/3 k M z F 2 , (13) 



with (3k a universal constant as in Theorem \2.2l A — A\f2kRr>'^, B = \/2k(f>, C = 

Furthermore, given qo £ (0, 1), for any q, q' £ (0, 1) satisfying q + q' = qo, w.p. at least 1 — go, 



1 

sup j — — 

ueD\{6} \\u - 0\\i 



i<N i<N 



< V2kF 1 hw) + a M 2 p) E ™ x E z m 

V _P i>3 V ~ P ^ N 



+ B ln(p/q)mzxJ2z? jh + Cln(p/q). (14) 

V ~ i,3 

Theorem 3.2 (Stochastic Lipschitz continuity). Fix an arbitrary 8 € D. Under Assumption^ for 
u and v £ D, 

]T [ 7i (^w,Fi)]- E MZ i v,Y i )} = J2ldMZ t O,Y l )jz7 J ^-v)+au,v) T (u-v), (15) 

i<N i<N i.j 

where v) £ W } is a process with the property that for any q 6 (0, 1), 
PrJ sup ||£(u, v)\\ ao > A L(2p) max £ Zf jh 

{ u veD Y j<k ~ P t<N 

+ SL(p/ q ) max^ Zf jh + Chx(p/q) j < q, 



(16) 



where, letting 

4> = 2M Z mm(F 1 ,F 2 M z R D ), rjj = 2kf3 2k M z F 2 , (17) 



with (3 2 k the universal constant as in Theorem \2.2l A — 4v2fci? j D?/'; B = V2k(fi, C = 

Furthermore, given qo £ (0, 1), for any q, q' £ (0, 1) with q + q' = qo, w.p. at least 1 — qo, 



1 

sup 



u ^v£D \\U - V\\i 



i<N i<N 



< V2~k Fl H2p/q<) max £ Zf jh + A /ln(2p) £ max £ ^ 



B \n(p/q) max^ + Cln(p/q). (18) 

V ~ i.J 



3.1 Preliminaries 



We shall repeatedly use several fundamental results in probability. First, the following lemma is a 
combination of the measure concentration results in |111 115) tailored for our needs. 



8 



Lemma 3.3. Suppose fi(u), . . . , /at (it) £ M are independent stochastic processes indexed by u e D, 
where D C K p is a measurable set, such that w.p. 1, each fi has a continuous path. Suppose there are 
a.i < bi, i < N, such that w.p. 1, 04 < fi(u) < bi for all i < N and u e D. Let 



W = sup 
ueD 



;<jv 



TTien /or any s > 0, 



Pr^VF>EJ-F + 2s^2(bi - at) 2 \ < 



(19) 



i<N 



Furthermore, assume E/j(u) = /or all i < N and u £ 13. Let M > smc/i i/iai iu.f>. / |/i(w)l < M , 
for all i < N and u £ D, and let S > suc/i i/ia£ 2i<jv Var(/j(ti)) — / or a ^ u £ D. Then for 
any s > 0, 



Pr jw > 2EVF + + 4Afs| < e 



(20) 



Next, we need the following comparison inequality involving univariate functions (cf. |14j . Theorem 
4.12; [20]). 

Lemma 3.4. Lei Dcl p be a measurable set and 71, . . . ,7at be continuous functions from D to R. 
Suppose f\, ... , /at are continuous functions M — > K t/iat map to and are all M -Lipschitz for some 
M > 0. If E%, . . . ,En are i.i.d. Rademacher variables, then 



E sup 



e»/i(7i(«)) 



< 2ME sup 



£ £»7i(«) 



j<iV 



The continuity assumption in the above two lemmas is used to ensure measurability and is satisfied 
in the situations we shall consider. Inequality (IT9l) is referred to as functional Hocffding inequality in 
[TB] . The proof of Lemma 13.31 is given in Appendix. Finally, we shall also repeatedly use the following 
inequality (cf. |16) . Lemma 5.2) 

Lemma 3.5. Let E\,...,Em be i.i.d. Rademacher variables and A C W a finite set. Let A\ = 
{a, —a :aei}. Then 



E max 



E 

%<N 



< max ||a|| 2 x ^/21n|Ai| < max ||o|| 2 x ^/21n(2|A|). 



3.2 Proof of local stochastic Lipschitz continuity 

We next prove Theorem 13. II Denote Ci = Z{9 for i < N and c = (c±, . . . , cjsr). For u £ D, denote 

ti = Zi(u -9) = Z t u -c t £ M fe 
and t = (ti, . . . , tpf). Note that ti are functions only in u. For each i < N, denote 

f i (-)=J i (;Y i ). 

For j < k, denote by Tfj the map (a;i, . . . , Xk) — > (xi, . . . , Xj, 0, . . . , 0). It is easy to check that, for 
every i < N, 



7,(z,m, Yi) - ji(ZiO, Yi) = fi(a + u) - fiia) 



(21) 



3<k 
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where, for s £ R fe , 



fi(Ci + KjS) - fi(Ci + Kj-xs) 



- djfiid) if Sj ^ 



djfi(ci + 7Tj_is) - djfi(ci), if s 3 = 0. 



Thus ifij is a function R fc — > R. We need some basic properties of ipij. Recall that Fi and F 2 are 
defined in Assumption [T] 

Lemma 3.6. W.p. 1, for all i < N and j < k, fij(0) — 0, < 2i<i uniformly, and ifij is 

(F 2 ,£ oc )-Lipschitz on R k . Furthermore, for u £ D, \tp i3 (Zi(u - 8))\ < F 2 M z Rd- 

From the decomposition (j!?Tj) and iy = — 0) £ R, 

2(7,(^,^0 - 7(^0,^)) -EE + ^«(« - 0)> 

i<7V i<N j<k 



giving 



i<N i,j i,j 



E ^h(u) 



i<N 



Recall ti = Zi(u — 9). Define for i < N and h < p 

Then, letting £(u) = (&(«), . . . ,£ p (it)), it is seen ([IT]) holds and 

||f(u)lloc = max \£ h (it) | < maxW h . (22) 

Given ft, consider the upper tail of Wh- For i < N and j < fc, by Lemma 13.61 \ipij(ti)Zijh\ < <fi, 
where = M z min(2Fi, F 2 M Z R D ) as in ([13]). Then 

< E I M**)! ^ijfcl < 2 ^ : = M o- (23) 

Given u £ D, for each i < N, £,ih{u) is a function only in Yi. Therefore, by independence and 
Var(5^ <& i/j-) < kJ2j<k V ar ( iy j) — ^12j<k f° r am/ random variables i/j, . . . , i>k £ R> 

Var (&(«)) = E Var < fcE E (^(*«)^) 2 ^ ^ 2 E Z ^ ^ S l ( 24 ) 

where 

S 2 = ^maxE4^ ( 25 ) 

From ([23]). ([25]) and Lemma l3~3l it follows that 

Pr{vK ft > 2EVF /l + S , o y2^ + 4Afos} < e" s . (26) 
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Since E^^(n) = 0, by symmetrization (cf. the comment after Lemma 6.3 in |14j ) and a simple 
dominated convergence argument 



EWh = E sup 

u£D 



E 

i<N 



22<Pij{ti)Zijh 



j<k 



< 2E sup 

ueD 



El e i 2_j Pijit^Zijh 
i<N j<k 



= 2Esup 



53 s% 53 </?ij (ti)Zij h 

i<N j<k 



where T = {(ti, . . . , tjv) '■ ti = Zi(u — 0), i < N} and £i, . . . , £jv are i.i.d. Rademacher variables in- 
dependent of Y\, . . . ,Yff. Given Y\, . . . , Yv, by Lemma f3T0l each 

ipi(s) = y ^2 / (p ij (s)Z ijh 

j<k 

is (kMzF 2 , £ oc ,)-Lipschitz mapping to 0. Note for j < k, {(iy, . . . ,tjvj) : (*i>--->*!v) G = 
{(Zjj(u — 6),...,Zjj(u — 6)) : it 6 D}. Then by Theorem 12.21 and the independence between 
Yi, . . . , Y/v and £i, . . . , en, letting ip = k(5kMzF 2 as in (TlT?j) . 



E sup 



i<N j<k 



< Ey UVE e sup 



Ey E £ sup 



i<N 



53 £ifii{u) 

i<N 

iP2_j EyEe sup 



j<fe 



i<JV 



<i?uV'7 EyE e max 
i<fe 



53^(^-0 

E £i ^' /i 



i<7V 



where the last inequality is due to 



i<N 



< \\u — 9\\i max 

h<p 



E £ i-^ij'! 



i<iV 



< max 



E £ i^ii'! 



i<iV 



for j < k and u £ D. By Lemma 13751 for each j < fc, 



E e max 



E e i"^ij>i 



The right hand side is independent of the values of Yi, . . . , Yv. We thus get 



EW h < 2R D ip V Ey E e max 



E 



KiV 



< 



2i? D ^72hu>) ^ /max ]T 2* „ 



j<k V - i<N 



< 2R D ^2k\n(2p) r£ max ^ 2?. h , 

V 3<k l ~ P i<N 



(27) 
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where the last inequality is due to Cauchy- Schwartz inequality. Then by (j26[) . 



Pr I W h > AR D iP^/2kU2p) max £ + 5 \/^ + 4M s \ < e' 8 , 

[ V j<k i<N J 

with Mo and So being defined in (|23l) and (|25p . Let s = ln(p/q) in the above inequality and sum over 
h < p. By (|22p and union-sum inequality, (|12j) is proved. 
To prove IT4j) . by ([TT|) and the above discussion, 



SU P n m~ 

u£D\{8} \\ u - 



< max 

h<p 



i<N 



max Wh ■ 

h<p 



Because of (fT2"T) , it is enough to show that 



i<N j<k 



Pr < max 

h<p 



> V2~k~F 1 ln(2p/q>) max^^ \ < l' ■ ( 28 ) 



Given h < p, J2j<k \®jli(ZiQ, Yi)] Zijhi i < N, are independent of each other, each having mean 
and falling between 

±F 1 \Zijh\ - E[dM z iO, Y i)] z ijh- 

j<k j<k 

Therefore, by Hoeffding inequality (|18j. p. 191) for any s > 0, 



Pr 



J2J2ld j7i (ZiO,Yi)}Z ijh 

i<N j<k 



>s \ < 2 exp ^ - 



2-^1 J2i<N(^2j<k \Zijh\ 



< 2 exp ■ 



2kF? J2i<N J2j<k Zfjh 



< 2 exp < - 



2kF? max h < p J2i,i Z? lh 



Let s = V2kFi</lri(2p/q')maxh<pJ2i Zf- h . Then by the union-sum inequality, (|2"5|) follows. 



3.3 Proof of stochastic Lipschitz continuity 



We next prove Theorem 13.21 Since the proof follows that for the local continuity, we shall only 
highlight differences in the proof. Denote c = (c%, . . . , cjv), with d = ZiO. For any u and v G D, 
denote s = (s%, . . . , sjv), t = (t\, . . . ,tjy), with Si = Zi{u — c), i, = Zi(v — u). Then d, Si, ti € R fe . 
It is important to note that unlike 6, both u and v are variables. Again, denote fi(-) = Ji(-, Yi) and 
7fj the map (x±, . . . , Xk) — > (ii, ■ ■ ■ , £j, 0, . . . , 0). Then it is easy to check 



1i{ZiU, Yi) - ji(ZiV, Yi) = fi(a + Si + ti) - fi(a + Si) 

= Z\2 ( d jM C i) + t Pij( s h t i))tij, 



(29) 



j<k 



where for s, t £ R fe , 



<Pij(s,t) = 



fi(d + S + TTjt) - fi(d +S + 7Tj_l*) 



9j/i(Ci + S + TTj-lt) - djfi(d), 



-djfiid) iftj^O 



if tj = 0. 
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The function (fij (s, t) has 2k real valued variates, s%, . . . , Sk, ii, • • • , ifc. 

Lemma 3.7. For all i < N and j < fc, Vij(0, 0) = 0, | V'ij ( * ? * ) I — uniformly, and ipij is 
(2F2,£oo)-Lipschitz. Furthermore, for u and v £ D, \(fij(Zi(u — 0), Zi(v — u))\ < 2F 2 MzRd- 



From decomposition (|29|) . it follows that 
E MZiV.Yl) - 7<(^«,^)1 = E I^/»( c »)l - u ) + E IWi(*i.*<)l - «)■ (3°) 



Define for i < iV and h < p 



E &?»(«,«) 



■;<at 



j</s i<JV 

Then, letting £(m, v) = (£i(tt, u), . . . , £ p (u, v)), it is seen ([TB]) holds and 

||£(m,i>)||oo = max|^(«,u)| < maxWV 

Fix /i. For i < N and j < k, by Lemma l3~7l letting <f> = 2M z mm(Fi, F 2 M Z R D ) as in (fT7)) . 

|^j( s i:*i)Zij/i| < 0- 



(31) 



Define Mo and So in a similar way as (|23[) and ([25| . except that they are in terms of <f> instead of i 
Then, as in ([26]), 



Pr \w h > 2EW h + S V2s~ + 4M s\ < e~ s . 
Notice that given Y\,..., Yn, each 

j<k 

is (2fcM z F 2 ,^ oc ,)-Lipschitz on R fc x R fc mapping (0,0) to 0. Then, following the derivation of ([27] 



(32) 



EW h < 2i?£,VV 2 fcln(2p) /V max V Z, 



2 



(33) 



where ip = 2kfi2kMzF2 as in ([17]) . The proof of ([TB")) can then be finished in a similar way as ([12 
The proof of ([T5|) is completely similar to (fTT)) . 



4 Lasso for multiple linear combinations of covariates 

Suppose 71, . . . , 7jv are measurable functions from R fe x to R. Let X\, . . . , X^ £ V :— R m be fixed 
covariate vectors and denote by Ui,...,Mjv G ^ parameters. In this section, we specialize to the 
following multivariate loss functions 



1i{Xju x , . . .,Xju k ,Yi) = 7i (Z i u,y i ), i < N, 



where 



(K 



&W xk , u 



€ R p , with p = km. 



(34) 



(35) 
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We assume that the form of ji is already known and consider the estimation of Ui, . . . , tXfc. 
Corresponding to the loss functions 7*, the total expected loss is 



L{u) = Eji{ZiU,Yi). 



i<N 



Let D C R p be a compact domain. Suppose 



9 = ...,9 k ) = argmin L(u). 

ueD 



The Lasso estimator for 9 is of the form 



(36) 



(37) 



=(?!,..., k ) = argmin ^ ^ 7;(X Ju i> *7«k,15) + A$3 IK" 111 



argmin <| 7 l (Z l T w, K ( ) + A||m||i >, 



ueD 



(38) 



i<JV 



where A > is a tuning parameter and in the expression on the second line, u is treated as a 
concatenation of u\, . . . , We shall assume that the minima in (|3"?| and (|38p are always obtained. 
However, neither has to have a unique minimizer. 



Denote by X the N x m design matrix with row vectors Xj , . . . , Xjj . For Z > 1 , let 



o-\ , : : max < ^f V } 2 : v G V, 1 < |spt(t>)| < Z 
w 2 



(39) 



To utilize a restricted eigenvalue (RE) condition introduced in [3], define, for s < to and K > 0, 

||Xd| 2 



Kx(s, ■K') := min 



N\\ttjv\\ 



v e E m \ {0}, ||7rjcw||i < K\\Trjv\\i, 1 < | J| < s 



Theorem 4.1. Assume S — maxj<fc |spt(0j)| < to/2. Letg £ (0, 1). Suppose the following conditions 
are satisfied. 

1) (Restricted eigenvalue) For some K > 1, k := kx(2S,K) > 0. 

2) (Quadratic lower bound of expected loss) For some C 7 > and all i < N and u € D, 
E lt {Z,u,Y t ) - E7i(Zi0,YO > C 7 ||^(w-0)||i. 

3) (Local Lipschitz) There is M q > 0, such that w.p. at least 1 — q, 



53 [7<(^« J Y < )-7i(^e,i<: 



i<JV 



< Mg||«-0||l = M q ^\\Uj - djWi, ^ W e 



Let 



A = 



(K + l)M q 



Ln — 



2M q K 



K-l ' J< A^G^iC-l)' 
Then, using this X in the Lasso estimator (f3"51) . w.p. at Zeast 1 — q, 



\9-9\\ 2 2 < kL N S 



, 2(l + X 2 )(^ 2 + a| s fc) 



l{fe > 1} 



(40) 



(41) 



14 



Comparing to the case k = 1, (|4"T]) has a multiple of 1 + a x S k/Nn 2 . The constant <7x,s is related 
to the so called ^-restricted isometry constant [9j. The ratio of ax,s to k bears some similarity to 
the condition number of matrix, despite the constraints imposed on their definitions. 

Example 4.1. Let f(y \ t) be probability densities on R parameterized by t E R k . Suppose that given 
covariatc x E V = K m , a response variable Y has density f(y | x T 6i, . . . , x T 9k), with 9%, . . . , 9k € V 
being unknown parameter values. To estimate 9, suppose Yi under fixed covariate values Xi, i < N, 
are observed. Denote Zi as in (|55"|) . If it is known that 9 — (9i, . . . , 9k) is in a bounded set D c V^ fc , 
then by (p8|) . one type of ^i-regularized likelihood estimator of 9 is 

9 = argmin \ - £ ln/^ | Z t u) + X\\u\\ x \ , 



ueD 



i<N 



where the tuning parameter A will be selected in a moment. It is seen that the loss functions 71 , . . . , 7^ 
in the setup are 7i(t, y) = — ln/(y 1 1) and for any u £ D, the total expected loss is 



where for any s, t E M fc , 



i<iV KN 



D(. 1 t)=|/(y|t)ln^d» 



is the Kullback-Leibler distance from f(y\s) to f(y\t). It is well known that D(s,t) > with 
equality if and only if f(y \ t) = f(y \ s). Therefore, 9 minimizes L(u). However, for high dimensional 
V and relatively small N, 9 may not be the unique minimizer. 

Suppose that all 9j satisfy |spt(0 3 )| < dim(V)/2 = m/2. To bound \\9 — 9\\2, assume X = 
(X\, . . . ,Xjv) t satisfies the RE Condition 1) in Theorem 14.11 Since D is bounded, the set of ZiU, 
i < N, u G D is in a compact domain A. Suppose that for some C 7 > 0, 

D{s,t)>C-y\\s-t\\l, s,teA. (42) 

The above condition is satisfied under mild conditions on the regularity of f(y\t), using the fact that 
for fixed t, the Hessian of D(s, t) at s = t is the Fisher information at t, which is nonnegative definite. 
Then for any i < N and u E D, 

E 7 . i (Z. i T M,K i ) - E ll (Zj9,Y i ) = D(Zi9,Z iU ) > C 7 ||^(0 - u)\\l 

so Condition 2) in Theorem 14.11 is satisfied. Finally, by Theorem 13. H if — In f(y\t) are first order 
diffcrcntiable in t, such that the partial derivatives are uniformly bounded and have uniformly bounded 
Lipschitz coefficient, then for any q E (0, 1), there is M q such that Condition 3) in Theorem 14.11 is 
satisfied. As a result, by setting A as in (|40|) . we get a bound for ||0 — 9\\ 2 using (|4T|) . □ 

4.1 Proof of Theorem 14.11 

The proof is divided into 3 steps. 

Step 1. The argument in this step has now become standard [^. Let c = (K — l)/2, where K is as 
in Condition 1). Then A = (1 + l/c)M q . From the definition of 9, 

L(9) - L(9) < £>i(£i0,li)] - £ MZiB, Y)j + (1 + l/ c )M,(||0||i - ||0||i). (43) 

i<N i<N 
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By Condition 2) in Theorem ETTI 

L{8) - L(9) > C 7 ]T W Z S ~ 0)111 = C 7 E - ^ 

i<JV i<JV j<k 

= C 7 ^||X(? i -0 i )|||. 

i<fc 

Then by Condition 3) and ((321), w.p. at least 1 — q, 

C^Wxfa -0,011a < M,£|fo - ^lli + (1 + l/cJM^dlfliHi - HO,- ||i). 

j<k 3<k 3<k 

Let Ji, . . . , Jfe C {1, . . . , m} be any sets with Jj D spt(0j). Then for each j < 

Hej-eilli-a + i/cJdl^lh-ll^llO 

= IK^-^Hi + llTTj^Ui + (1 + l/cJdl^H! - HtTj,?,-]!! - IKj^Hi) 

< (2 + l/c)||7r J ,S J --fl J || 1 -(l/c)||7r J? S J -|| 1 
= (if/c)||7r J ^-fl i ||i-(l/c)||7r J .? i || 1 . 
It follows that w.p. at least 1 — q, 

-0,011! <^EWK'^-^Hi - K^UO- (44) 

J < 7 J < 

Fix an instance of (Yi, . . . , Y/v) such that (|4"4")l holds. Let Ai, . . . , C {1, . . . , m} be sets such 
that spt(0j) c A 3 and I A' I = ^ Tnen (SU nol ds with Jj = Aj. Let 

I = {i<fe: J ft:||7r A3 .0 J --0j||i>||7r^0j|| 1 }. 

Then 1^0. Wc shall consider j € 7 and j ' ^ I separately. 

Before moving to the next step, for each j < k, let Bj be the union of Aj and the indices of the 
S largest \0jh\ outside of Aj. Then (|4"4"1) holds with Jj = Bj. It is now well-known that [5] 

||. B . Sj ||| < tiM. (45) 

It is easy to see that for j < k, \\TTAjOj — 0||i < IK-Bj^j — 0||i and ||7TA?0j||i > ||7rs?0j||i- 
Step 2. From flU}, 

£ - ej)\\l < pL J2(KhA 3 e 3 - 0,11, - IK^ho 
^^E^IK^-^IIi-IKb^IIx). 

7 je/ 

For each j G I, ^||7r B;f 0j - 3 ||i - ||7T B ?0j||i > K\\-KA$j - 0j|| a - ||7TAf0j||i > 0, so by Condition 1), 
NK^WirjMj — 9j\\2 < \\X{6j — 0,011! holds for Jj = Aj, Bj. Letting Jj = Aj, from the above display 
and Cauchy-Schwartz inequality, 

iv« 2 £ 11^0, -^ll^^fE \M °ih 

j&i 7 jei 



\ 1/2 

Eh^A-^i 
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giving 



J2hA 3 d,-e 3 \\j<L 2 N s\i\. (46) 

jei 



Likewise, letting Jj = Bj, it follows that 



Y / hB j e,~e 1 \\ 2 2 <2L%s\i\ 1 (47) 

jei 



where the factor 2 is due to \Bj \ = 2S. 

By (|4"5|) , (|4"6")l and Cauchy-Schwartz inequality, it follows that, 

j'ez jei je/ 

< if 2 ElK^- sill- 
ier 

Combining the inequality with (|46| and (|47|) . 



E ifo - o\\l = EflK-^ - iHi + iH^-iia) 

< (2 + if 2 )L^5|7|. (48) 

Step 3. We next consider j I. The idea is to modify each Oj into some Oj that can be dealt with 
by the argument in Step 2. For j £ I, K\\TTA j 0j — Oj\\i < \\TTA?Oj\\i- Then from (|4~4")) . we have both 

£ ~ ^)Hi < ^ £ IK^ - iHi = W^£ IK^ - ^||i (49) 



and 



Cvv c 

j0i 7 je/ jei 



By Cauchy-Schwartz inequality and (1461) . 

E - e.h < Vs£ h^h - Ojh 

jei jei 

Let Sj = \\irA°8j\\i ~ K\\-K Aj 0i - Oj\\x- Then ^ > for j g I and by ([50]) and (|5T|). 

£<*,- < ifLjv5|7|. (52) 

For each j £ I, define 

heA 3 
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where sign(:r) = 1 {x > 0} — 1 {x < 0} and eh is the hth. standard basis vector of M. m . Then for 
h $ Aj , Ojh = djh, while for h G A, , 



As a result, for j £ I, 



K\\Tr Aj j -0 j \\ 1 =K\\\n Aj j -9 j \\ 1 



HK^-||i = h A {e 3 \\i, 



heAj 



and consequently ||7rs90j||i < K\\nBjOj — 9j\\i- Then by Condition 1), 



(53) 



On the other hand, by the inequality \\s + t||| < ^ C II 18 II 2 + 11*111) f° r s , t £ Mr, and the inequalities 
in (05) and dHJ 



2 - 0,011! < 2j2(\\x(d j - o^wl + \\x{e s - e s ) 



< 2L NK 2 NY / hA 3 e, -e j \\ 1 + 2j2\\x(e j -dj)\ 

jei jgi 
<2L 2 NK 2 NS\I\ + 2J2\\X& - 

m 



Recall the definition of ox,i- Since |spt(0j — 6j)\ < \Aj\ — S, then 

Eii^-^iii^.sEii^-^ii! 



j&i heAj jgi 



Then by f> 



(54) 



5311^(^-^)111 < 



u x,s 
K 2 S 



E^l <°l,sLlS\lf 



Plug this inequality into ([54]) and combine the result with ([S3)) to get 

2SL 2 N \I\(Nn 2 +a 2 \I\) 



E iK-0i - e 3 -ni < E \\*B,ei - OjWi < 



Nk 2 



Following the derivation of ((48 



£ifo-*iii2< 



2S{l + K 2 )L 2 N \I\{Nn 2 + a^ s \I\) 
Nk 2 



It is easy to see that \\8j — 9j\\2 < ||0j — Qj\\2 for j /. Therefore, 

2S(l + K 2 )L 2 N \I\(N K 2 + a 2 x JI\) 



Nk 2 



(55) 



Note that the left hand is if k — 1. Therefore, we can multiply the right hand side by 1 {k > 1}. 
Finally, combining ([48[) and ([55]) . the proof is complete. 
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5 Hidden variable model 



Suppose (ui, Y\), . . . , (wn,Yn) are independent random vectors taking values in x y, and the space 
can be equipped with product measures d/ii x dvi, i < N, that are not necessarily the same, such 
that each (u>i,Yi) has a joint density with respect to d/ii x dvi as 



Pr{w 4 e dz,Y, e dy} = 



g t (x l (z) T 6)k l (z,y)n l (dz)v l (dy) 
Zi{6) 



(56) 



where ki, gi and Xi are known functions with Xi : fl — > M p , 6 € R p is the true parameter value which 
is unknown, and for each u €R P , Zi(u) is the normalizing constant 

Zi(u) = J gi(xi(z) J u)ki(z,y)m(dz)ui(dy). 

Suppose that only Y\,...,Yn are observed, while loi,...,Un are hidden. The (log)-likelihood 
function is then 

£(u)=e(u,Y u ...,Y N ) = -J2ln J g l (x I (z) T u)h(z,Y)^(dz) + Y,^Z l (u). 



i<N 



i<N 



We next consider the local stochastic Lipschitz continuity of t(u) at the true parameter value 0. By 



Yi 



J gi(xi(z) T u)ki(z,Yi)^i(dz) 


= E 


gi{xi{uii) T u) 


J gi(xi(z) T 9)ki(z,Yi)iJ,i(dz) 




_gi(x i (uj i ) 1 8) 



and 



we have 



Zj(u) 



g l (x i (uj i ) 1 u) 
g l (x l (uj i ) T 0) 



(«) - 1(0) = - ^ 

i<N 



i(xi(z) T u)ki(z, Yi)ni(dz) 



Zi{u) 



i<JV 



ft(^,K) T ") 



i<JV 



9t {xM) T u) 
g l {x l (uj l ) T 0) 



_gi(xi(uJi) T 0) 

Let D be the search domain and suppose it is known that € D. For the tail of 

I [IN -i(fl)] | 

" n an ' 

ueD\{0} 

our analysis is based on the following assumption. 
Assumption 2. There is Mx > 0, such that w.p. 1, 

Ik^w^Hoo < M x , all i < N. 

For all i < N, gi(t) is first order differentiable. Moreover, there are < A g < B g < oo, F\ < oo, 
F2 < 00, such that 



A g < 9l (t)<B g , |^(*)l<^i, \9'i(t) ~ g'M < F 2 \t - s\, all i<N. 
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We need to introduce some constants. Denote 

R D = SUP ||tl-0||l, Ig = [Ag/Bg,Bg/Ag]. 

Define for zel 

f z-T-Ml + z) - 1 z ^ 

Q\ z ) = \ 
W [0 z = 

It is easy to see that g is smooth and strictly decreasing on (—1, oo) with g(0) — 0. Denote 

go := sup \g(t — 1)| < oo. gi := sup \g'(t — 1)| < oo. 
teig teig 

Denote the following constants 

%h = Fi/Ag, yj 2 = F 2 M x /(2A g ), ^3 = min (2F U F 2 M x R D /2) /A g , 
ip4 = [ipiQo + ipa(l + Qo)]M x , i>5 = 2tpiM x gi, ip 6 = 2(g a + ip 3 M x gi). 



Theorem 5.1. Denote 



S x = max / \ Xij (o 



i<N 



where Xij(u>i) is the jth coordinate of Xi(uii). Under Assumption^ for any qo, qi G (0,1) with 
qo + q\ < 1, w.p. at least 1 — go — 1i> 

\£{u) - £{6)1 1 



SU P || fl , n ' < 2V2R D ES X (A^H2pj + 2^ 5 + ^) 



+ V2N (ViMxA/ln(2p/ 9o ) + 2^ Ay /]n(p/q 

where A = 2<ip 2 (l + W + (^i + 2^ 2 R D + 2^ 3 )(^5 + V> 6 )- 

To see how Theorem 15 . 1 1 may be used, consider the following Lasso type estimator 

6 = argmin {£(u) + A||«||i} , (57) 
ueD 



where A > is a tuning parameter. The next result is in the same spirit as Theorem 14.11 and actually 
simpler, as no design matrices are involved. Furthermore, it holds in a more general setting than the 
hidden variable case. 

Proposition 5.2. Let £(u) be a stochastic process indexed by u G D C R p . Fix G D. Let 
S := |spt(0)| < p/2 and q G (0, 1). Suppose the following two conditions are satisfied. 

1) There is a constant Cg > 0, such that E£(u) — E£(0) > Ct\\u — 

2) There is M q > 0, such that 

n f | Wu) - £{0)j | 1 

Pr { sup n \ ' " 1 > M q \ < q. 

[ueD\{0} \\ U -V\\i J 

Given K > 1, let A = KM q in (|57[) . Then w.p. at least 1 — q. 



\e-e\u < 4/2- 



(if + 1) 2 (K + l)M q y/S 
(K - l) 2 C e 
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Proposition 15.21 requires two conditions. On the one hand, Theorem 15.11 can be used to derive 
Condition 2). On the other, Condition 1) requires extra assumptions to establish. In the context of 
hidden variables, since 



E[£(u)-e(9)} = -J2 E 



i<N 



InE 



gi(xi {u>i) T 0) 



Y, 



£ InE 



i<N 



1 (x. i (w. i ) T M) 



gi(xi(uJi) T 9) 



(58) 



we need some assumptions on the structure of (u>i,Yi). We next consider a case in which both Ui 
and Yi are processes. One could have a hidden Markov model in mind, with u)i the hidden Markov 
process and Yi the observations. 

Suppose (uJi,Yi) are i.i.d. and for each i < N, oji — (oju, ■ ■ ■ ,0Ji n ) and Yi = (Yn, . . . , Yi n ) are 
jointly distributed processes, such that all u>u take values in a common alphabet A — {1, ... ,1 + L}, 
and conditioning on Wj, Yn, Yi n are independent with Yn ~ N(uji t ,a ). Suppose Wj follows a 
tilted version of a baseline distribution ttq{z) 



tt(z I 9) (x ttq(z) exp < 1 {zi = a} I 



(zi,...,z„) e A™. 



i<n a<L 



Note that while A has L + l different letters, to make sure the idcntifiability of 0i a , only L parameters 
are associated with each t < n. 

Let /ij = ttq and fj the Lebesgue measure on R. For i < N, let ki(z, y) — Ylt<n f (( z t ~ Ut) / o~) with 
/ the density of N(Q, 1), x t (z) £ {0, 1}™ L with the ((t - 1)L + a)-th entry equal to 1 {z t = a}, and 
& = ■ • ■ j 0il, • ■ • , 9ni, ■ ■ ■ , Snh)- Finally, let gi(x) = e x . Then the above model can be formulated 
as in (151)]) . Denote X t = Xi(uji). By (|55j) . 

E [l(u) - 1(9)] = N [lnEe x i T ("- e ) - Eln £(e x ^ Y t ) 

Suppose that it is known that 9 £ D, where D £ R ni is a bounded set. As discussed earlier, the 
concern here is Condition 1) in Proposition 15.21 We can make the following assertion. 

Proposition 5.3. Suppose ttq(z) > for all z £ A n . Then Condition 1) of Theorem \5.1\ is satisfied. 



5.1 Proof of Theorem 15.11 

For notational easy, denote 

v = u-9, X i =x i (u H ), Ei(-) = E(-\Yi) 

and 



7i(«) = E 2 



9l (Xj(9 + v)) 



1, -y(v) 



g t (Xj(9 + v)) 
9i{Xjd) 



Note that, because 9 is fixed even though unknown, ji(v) is a random function only dependent on v 
and Yi, while h(v) is a nonrandom function only dependent on v. Then 



t(u) - 1(9) = -J2 7i(v)(Q(7i(v)) + 1) + E 7(*)G?(7(tO) + !)■ 



(59) 



Define functions 



,' 9i(s + t)- 9i(s) , . . . .. 

Ai(s) = (ln fli )'(s) = ^f(, ^(a,t) 



9i(s)t 



(60) 







i = 0, 
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of s, t € K. Define Zi — (z.n, . . . , Zi p ) and Si(v) — (sn(v), . . . , Si p (v)) with 

z ih = E, [\i{Xje)X ih ] , = Ei [ Vi {Xje, Xjv)X ih ] , i<N, h<p. (61) 

Note that each Zih is a function only in Yi, and each Sih(v) is a function only in v and 1^. Then 

7i (t7) = E, [X i {Xj9)X i + ^(X i r 0,xTi ) )X i ] T i; = {z % + Si (v)) T v (62) 
which combined with (|59|) yields 

[*(u) - £(0)1 = - E K 1 + <?(7i(«)))(*« + ^))J T v. 



i<N 



For i < N and h < p, write (ih(v) = Sih(v) + ZihQ("fi(v)) + Sih(v) g(-fi(v)) so that the above equation 
can be written as 



o - mi = - E E m ] - E I E [c^wi i >■„ 

h<p \i<N I h<p \i<N 



Define for h < p, 



Then 



W h = sup 

itG-D 



E 



i<N 



| [*(«) - £(0)1 | 
sup I, — », < max 
u e-D\{0} ll«-0||l ' 1 ^p 



E m 



max Wfc. 
/i< P 



(63) 



Lemma 5.4. fij M^.p. J, the following inequalities hold simultaneously, 

\K(xJe)\<^, ?®ge/,. 

<^i(s,0) = and iw.p. J, /or all s G M, i < N and h < p, (fi(s, -)Xih is ip2~Lipschitz and 
\(pi(s,Xj(u — 0))\ < -03 for all u e D. 

Given h < p, from Lemma l5.4[ w.p. 1, for all i < N, \zih\ < i\>\Mx- Then by union-sum inequality 
and Hoeffding inequality, 



E m 


> \fNij)xMxt j 


<ep4 


E m 


i<N 






i<JV 



Pr ^ max >; [:,,,! > \'„Y, _l/, v / } - > ; Pr < > ] [:„,] > v .V , •, M x I } *_ 2 P , ! ~~. 
Letting t = y / 2\n(2p/qo) then yields the following bound on the first term in (f63|) 

E m 



Pr < max 



i<N 



>TpiMxV2Nln(2p/q ) } < 



qo- 



(64) 



Given h, from Lemma [5.4[ w.p. 1, for all i < N and v = u — with u € D, \Cih(v)\ < ip4, so 
-ip4 - ECih(u) < [Ci?i(«)l < V>4 - ECih(u). Then by LenrnaQ inequality ([19]), 



Pr |Wft > EW/, + 2i/> 4 V / 2~/Vs j 
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By union-sum inequality, it follows that 



Pr<^ m&xWh > maxEWh + 2ij} i y/2N\n.(p/q{) \ < q x , 

h<p h<p 



(65) 



We need to bound EWft for each h < p. Since £ih(v) are continuous in v and bounded, by 
dominated convergence argument, we can apply symmetrization ([14]. Lemma 6.3) to get 



— 7T- < E SUp 



: .<N 



< E sup 

ueD 



^ SiS itl (v] 



■<N 



, r(l) , r(2) 



where ei, . . . , en are i.i.d. Rademacher variables independent of (w,-, Yi), and 



1$ = E sup 

U.G-D 



eiZ ih g{-ji{v)) 



i<N 



4 2) = Esup 

itG-D 



i<N 



£iSih(v)e('Yi(v)) 



Define 



U h = sup 



J2^m(xJe,xJv)x ih 



Q = sup 



^2,eai{u-e) 



i<N 



To continue, we need the following result. 
Lemma 5.5. For each h < p, 



E sup 



^ £iSih{v) 



i<N 



< EU h < 2^21n(2p)i> 2 RDESx- 



Furthermore, let 



Then 



M v =2\[2 v /ln(2p)V'2i?D + M\Z^P + 1) 



E5x. 



E max [7 ft < A% , Eg < i? D ( v /21n(2p)^ 1 ESx + M v ). 



To bound , by Fubini theorem, 



ift' = E y E e sup 



i<JV 



£iZihQ{li{v)) 



(66) 



(67) 



Given Yi, . . . , Y/v, are fixed and 7i(t>) become nonrandom function in v = u — 0. By Lemma 15.41 
the nonrandom function t — ► Zihg{t) maps to and is (f/;5/2)-Lipschitz. Then by Lemma |3.4[ 



< V'sEyEe sup 



i<JV 



= VsEQ. 



(68) 



To bound L h = sup ue£) ^2 i<N £iSih{v)g{^i{v)) , we have to use the multivariate comparison 

results in Section [21 Given Y\,...,Ypf, both Sih(v) and ji(v) with v = u — are nonrandom 
functions of u £ D. Let <?(s, i) = sg(t) for i < N and 



T = {t = (*i,..., tjv) :*< = (»i&(»),7i(w)), «<^> v = u-6, ueD}. 
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Then 



E E sup 



^e l s lh {v)g{^ i {v)) 



■<N 



E £ sup 
teT 



By Lemma IST41 w.p. 1, for all i < N, h < p, and u g D, (sih(v) , ji(v)) G J, where 

J = {{s,t-1) : \s\ <tp 3 M x , telg}. 
Lemma 5.6. g(s,t) is (ipQ/2,£ oc )-Lipschitz on J. Furthermore, \g(s,t)\ < (^jq/2) min(|s|, \t\ 
From Lemma 15.61 and Theorem 12.11 



E e sup 
teT 



r 

Integrating over Yi, . . . , Y/\r, we thus get 



E 

i<Af 



< -06 I E e sup 
teT 



i<N 



E e sup 
teT 



;2 



if < A ( E sup 



!<JV 



E sup 



■<N 



<ip 6 E(Q + U h 



(69) 



where the second inequality is due to (|6"7|) . Combine (j6"5)l . (JHSJ) and Lemmas 15. 51 
—± < (1 + i' 6 )EU h + (A + A)EQ 



< (1 + ^6)2^/2 ln(2p)^ 2 i?DE5x + (ip 5 + ^RdW^H^P^Sx + M v ) 



Note that the bound holds for all h < p. Incorporate the bound into (|65|) . Together with (|63|) and 
(|64|) . this finishes the proof. 



5.2 Proof of Lemma 15.51 

To prove the Lemma, we need the following result. 
Lemma 5.7. Suppose £ > swc/i i/iai /or some a, o, c > and d > 1, 

Pr{£ > a + 6\/s + cs} < de~~ s , s > 0. 

T/ien E£ < a + 6(Vhid + l) + c(hio:+l). 

Proof. First, if 6 = c = 0, then Pr{£ > a} < de~ s for any s > 0. Let s — > oo to get £ < a and hence 
E£ < a + b(Vhid + 1) + c(ln d + 1). Assume 6 + c > 0. Condition (JTOJ) implies 



(70) 



Pr 



{£ > a + bVs + lnd + c(s + lnd)| < e~ s , s > 0. 



By Vs + In d < -y/s + Vhid, 

PrU>oo + /(s)} <e" s , S >0, 

where ao = a + 6Vmd + chid and f(s) = b^/s + cs is a 1-to-l and onto mapping [0, oo) — » [0, oo). 
Let / _1 be the inverse of /. Then 

pClo poo pOO 

E£= / Pr{£>£}dt+/ Pr{£ > a + i} dt < a + / cxp {-f~\t)} dt 
Jo Jo Jo 



a Q + e s f'(s)ds = a + / e s (bs 1/2 /2 + c) ds < a + 6 + c, 
Jo Jo 



which completes the proof. 



□ 
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Proof of Lemma \5.5l By independence and Jensen inequality, 



sup 
ueD 



sup 



Y,^i{xJe,xJ v )x ih 



i<N 



Si,Yi,i<N 



<E[U h \e h Yi,i<N] 



Integrating over Y\, . . , , Yjy and e\, . . . , £n leads to the first inequality in (|67|) . Given X\,.,., X^, by 
Lemma \5Al t — > «-fi(X] 6 ,t)Xih maps to and is T/^-Lipschitz. Therefore, by Lemma T3.41 Holder 
inequality, and Lemma 13.51 



E s U h < 2ip 2 ^s sup 



< 2ip2Ro^-e max 

3<P 



^ £ i X i. 



< 2ip 2 R D max X?. x V21n(2p) = 2 ^2 ln(2p)^ 2 i? D 5 x . 



Integrating over Xi, . . . , Xjy yields the second inequality in (1571) . 

On the other hand, given X%, . . . , -Xjv, by Lemma l5"^fl Xj T ' u )l — ^3 f° r eacn u E D. Since 

ei, . . . , ejv are i.i.d. Rademacher variables independent of Xi, . . . , Xjy, then by Lemma [3~3l inequality 
(fT9|) , for each h < p, 



Pr \ U h > E £ U h + 2V-3 2s ^ 



Xi , . . . , Xjv > < e 



(71) 



Incorporate the bound on E £ Uh into (|71[) and apply union-sum inequality to 

tax U h > 2y/2\a.{2p)ifaR D Sx + 2y/2~s^S x X u . . . , X N \ < pe 
<p J 

By Lemma 15.71 we get 



max [4 | Xi, . . . ,X N 

h<p 



< 2^/2hx{2p)i) 2 R D S x + 2V2ij 3 Sx(Vtep+ 1) 



Integrating over Xt,... ,Xfj yields the bound on Emax^< p Uh- 
Finally, by dM]) and ||u - 9\\i < R D for all ueD, 



Q = sup 



2J 22 £ i( Z ih + s ih{v))v h 
h<pi<N 



< i?u sup max 



2J £i(zih + s ih (v)) 



i<N 



Following the proof for the first inequality in ([63 



EQ < Rn Emax 



;<jv 



E max Uh 

h<p 



Given X U ...,X N , by Lemma E31 Ei<iv {\i{XjO)X ih ) < ipfS x for all h < p. Therefore, by 
Lemma 13.51 and Fubini theorem, 



E max 

h<p 



Y,^HxJe)x l . 



i<N 



< V2H2p)^ 1 ES x , 



which together with the bound on Emax/j< p Uh yields the desired bound on EQ. 



□ 
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Appendix: miscellaneous proofs 

In this section, we collect proofs for the lemmas and propositions in the main text. 

Proof of Lemma \3.3[ First, assume D is finite. Let D + = D x {1} and D_ = D x { — 1} and denote 
T = D + U D-. The random vectors Xj = (afi(u), (u, a) 6 T), i < N, are independent taking values 
in R T . It is easy to check that 



W 



U [i<N i<N J i<N 



where X^t denotes the t-th coordinate of Xi. Since a$ < X^t < 6j for t = (it, 1) and — 6, < X^t < — a* 

h<n( 



for t = (u, -1), by Theorem 9 of [15], letting L 2 = ^ i<Ar (&j - a^) 2 , 



;r 2 



Pr{W > EW + x} < ex P|-^2j > x>0 ' 
which implies (fTTH) . 

Still assuming 13 is finite, assume moreover that E/j(u) = for all i < iV and u E D. For each 
£ = {u.a) G T, define St = (s*,...,Sj ), with each s\ being the map x — > x t /M. Then w.p. 1, 
s\{Xi) = afi(u)/M € [-1,1] with mean for i < N and 



sup Var V sl(Xi) = sup Var V afi(u)/M 

= sup Var(/i(«)/M) < (S/M) 2 . 

We next apply Theorem 1.1 of [33] to W" = sup t6T ^i<jv 4(X')- Let w = 2EW + (S/M) 2 . Then for 
any a > 0, 



Pr|l^ > EIF + aj < exp j- 



a 2 



2w + 3a 



For any s > 0, a = (3s + V9s 2 + Bws) /2 is the unique positive solution to a? /{2w + 3a) = s. By using 
y/x + y < \fx + y/y and 2y/xy < x + y for x, y > 0, it is seen that a < E1F + (S '/ 'M)v2s + 4s. So 



Pr > 2EIF + {S/M)V2s + 4s } < e" 
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Since W = W/M, O then follows. 

For an arbitrary D, by the path continuity of fc, W = lim n sup ug£3n | J2i<N fi( u )\i with D\ C 
D 2 ■ • ■ being a (nonrandom) sequence of finite subsets of D. Then the proof is complete by monotone 
convergence. □ 

Proof of Lemma \3.b\ Denote /j(-) = 7 i (-,y i ). For any seK', whether or not s 3 ■ = 0, 

fij( s ) = / (dj.fi(ci + ttj-iS + Sjuej) - djfi(ci)) du, 
Jo 

where ej is the jth standard basis vector of M fc . It follows that tfij(0) — and for s £ R fc , 
\<Pij(s)\ < / \djfi(ci + Ttj-xs + Sjuej) - djfi{ci)\ du < 2F 1 . 



Furthermore, for t £ 



\<fiij(s) - <Pij(t)\ < / \djfi(ci + n^is + Sjuej) ~ djfiict + Kj-\t + tjUBj)] du. 
Jo 

Since djf% is (F 2 , ^oo)-Lipschitz, the integrated function on the right hand side is no greater than 
-f^ 1 1 7f j — 1 ( -s — t) + (sj — t^uejWoo < F2\\s — t\\oo, and so the integral is no greater than F 2 \\s — t\\oo, 
proving ipij is (i 7 ^, ^oo)-Lipschitz. In particular, letting s = and t = Zi(u — 6) gives |y-y(i)| < 
F 2 \\Z i (u-0)\\ oo = F 2 max j \Zj j (u-0)\<F 2 M z \\u-0\\ 1 <F 2 M z R D . □ 



The proof of Lemma 13.71 is similar to Lemma 

Proof of Proposition 1 5. 2\ The proof is more or less standard (cf. [4]), so we will be brief. By definition 
of and the assumptions of Proposition 15.21 w.p. at least 1 — q, 



C t \\e -0\\l< E£(0) - Ei{0) < {£(6) - £(6)j ~ A(||6»|| a - ||0||) 

<M,||S-fl||i-iirM fl (||e||i-||e||i). 

Let A = spt(0) and B the union of A and the indices corresponding to the S largest \&h\ outside of 
A. Let a = M q /Ci. Then by the above inequality, for J = A, B, 

\\irj0 -9\\l + h,j.d\\ 2 2 <(K + l)a\\Trjd - 0\\i - (K - l)a\\irjod\\i, 

giving 



hj0-0\\ 2 < (K + l)ay/\J\, \\irjoO\\i < c\\vj0 - 0\\i, 

with c = (K + l)/(K — 1). Since ||7r B c0||| < \\-k Ac 0\\\/S (cf. [5]), the second inequality in the display 
gives ||7Tb<=<?||2 < c 2 \\ita0 — 0\\\. The proof obtains by combining this and the first inequality in the 
display, applied to A and B, respectively. □ 



Proof of Proposition 15.31 Recall X\ has nL entries, such that for t < n and a < L, the ((t— l)L+a)-th 
entry is 1 {uu — a}- Since ||-Xi||oo < 1, it is not hard to see that E£(u) is continuously diffcrcntiablc 
in u and its Hessian at is Hn = NH, where 

H = Varpd) - E(Var(X! | Y t )) = Var(E(JCi | Y t )). 

First, we show that for any u ^ 0, E[£(u)] > E[£(0)]. By Jensen inequality, E [£(«)] > E[^(0)] with 
equality if and only there is a constant c = c(u) such that for all y G 1™, E(e Xl ( u ~ 8 } \ Y\ = y) = c. 
If for some u the equality holds, then 

£ e ^) T («-e)-Q e -( s , t -, t ) 2 /2 CT 2 7r(z | ) = c \{e-^-^ 2 l 2 °\{z\0), 

zeA n t<n zeA* 1 t<n 
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where x(z) £ {0, l} nL such that its ((t — 1)L + a)-th entry is 1 {z t = a}. Denote 

p(z) = ( e -(-) T («- e ) - C )JJ e- 2 ?/ 2 -\(z | 0) 

t<n 

and make change of variable j/t/cr 2 — > y t . Then for all y £ R™, X^e/i™ ex P(y Tz )p( z ) = 0- I n other 
words, the Laplace transform of p{z) is 0. Therefore, p(z) = 0. By assumption, ir(z \ 6) > for all 
2 £ R. As a result, x(z) (u — 0) = lnc for all z £ A n . Let all Zt = L + 1 to get x(z) = and 
hence lnc = 0. Next, given t < n and a < L, let z t = a while for s ^ t, let z s = L + 1. This yields 

U(t-l)L+a ~ 0(t-l)L+a = °- Thus 11 = 0. 

Because E£(u) is continuous in u £ _D, with D being bounded, to finish the proof, it remains to 
show that H is positive definite. Suppose v T Hv = for some v. We need to show that v = 0. Since 
v T Hv = Var(E(X 1 r u | Y{)), there is a constant eel, such that E(Xjv \ Yx = y) = c for all j/ £ W 1 . 
Following the argument employed to show E£(u) > E£(6) for u 7^ 0, it can be shown that v = 0. 
Thus the proof is complete. □ 

Proof of Lemma \5.4\ Part (1) is straightforward from Assumption [2] and the definition of Xi and 
To prove (2), observe 

1 



Then 



i Pi (s,t) = — r - {gi(s + tu)-gi(s)]du. (72) 
9i(s) Jo 



1 Z" 1 

kft(« ) t') *)| < -j- / |${(« + ^«)-fli(* + *«)|du 

^9 JO 



^9 ^0 



ig JO 2A g 

showing (fi(s,-) is (F 2 /2A ff )-Lipschitz. As a result, (^j(s,-)Aj/j is V2-Lipschitz. Let t 1 = and 
* = A 2 t (m - 0)- Then t^(s,i') = 0, so by \t\ < M X R D , \<Pi(s,t)\ < ip 2 M x R D - On the other hand, 
(|72l) implies |^(s,t)| < 2Fi/A fl . Therefore, |¥>i(s,i)| < ip 3 . □ 

Proof of Lemma \5.6\ Given (s,t), (s',t') £ J, let d s = s' — s, dt = t' — t. By Taylor expansion, for 
some 9 £ (0, 1), g(s' ,t') - g(s,t) = dig(s + 9d s ,t + 9d t )d s + d 2 g{s + 8,t + 8d t )d t . Since dig{s,t) = g(t) 
and d-2g{s,t) = sg'(t), then by Lemma [5~4l 



\g(s',t')-g(s,t)\ < qo\s' -s\+fo 6l \t'- t\. 
Therefore, h is (V'6/2,^oo)-Lipschitz. On the other hand, since g(0, t) = 0, for some 9 £ (0, 1), 

\g(s,t)\ = \g(s,t)-g(0,t)\ = \d l9 {6 S> t)\\s\ < g \s\. 
Similarly, since g (s, 0) = 0, \g(s,t)\ < ip 3 gi\t\. As a result, \g(s,t)\ < (ipe/2)mm(\s\,\t\), □ 
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